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1 . Introduction 


CARE III (Computer-Aided Reliability Estimation, version 
three) is a computer program designed to help estimate the re- 
liability of complex, redundant systems. Although the program 
can model a wide variety of redundant structures, it was developed 
specifically for fault-tolerant avionics systems - systems dis- 
tinguished by the need for extremely reliable performance since a 
system failure could well result in the loss of human life. 

It is usually relatively easy to design enough redundancy 
into a system to reduce to acceptably small levels the probability 
that it fails due to inadequate resources. The dominant cause of 
failure in ultra-reliable systems thus tends to be due not to the 
exhaustion of resources but rather to the failure to detect and 
isolate a malfunctioning element before it has caused the system 
to take an erroneous action. Such failures are called coverage 
failures. CARE III differs from its predecessors in, among other 
things, the attention given to coverage failure mechanisms. 

The first CARE program, developed at the Jet Propulsion 
Laboratory in 1971, provided an aid for estimating the reliability 
of systems consisting of a combination of any of several standard 
configurations (e.g. standby-replacement configurations, triple- 
modular redundant configurations, etc.). CARE II was subsequently 
developed by Raytheon, under contract to the NASA Langley Research 
Center, in 1974. It substantially generalized the class of redun- 
dant configurations that could be accommodated, and included a 
coverage model to determine the various coverage probabilities as 
a function of the applicable fault recovery mechanisms (detection 
delay, diagnostic scheduling interval, -isolation and recovery 
delay , etc . ) . 


1 


CARE III further generalizes the class of system structures 
that can be modeled and greatly expands the coverage model to 
take into account such effects as intermittent and transient 
faults, latent faults, error propagation, etc. In order to ac- 
complish this, it was necessary to depart substantially from the 
approaches taken in previous reliability modeling efforts. The 
nature of, and the reasons for, this departure are explained in 
the following section. 
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2 . Background 


Reliability models tend to fall into one of two classes: 
combinatorial or Markov. Combinatorial models attempt to cate- 
gorize the set of operational states (or, conversely, the number 
of non-operational states) of the system in terms of the function- 
al states of its components in such a way that the probabilities 
of each of these states can be determined by combinatorial means . 
Markov models concentrate on the rate at which transitions take 
place between different system states and then use this informa- 
tion to determine the probabilities that the system is in each of 
these states at any given time. These two approaches, and the CARE 
III departure, are best illustrated by an example. 

Consider a simple, redundant structure consisting of four 
identical elements, the (binary) outputs of which are passed 
through a majority voter. If the outputs of at least three of 
these units are correct, the voter output is likewise correct. 
Further, if any one unit is determined to be faulty, its outputs 
are subsequently ignored by the voter, so that a second failure 
can also be tolerated without producing an incorrect output. 

First, assume the voter is perfect both in its ability to produce 
an output determined by the majority of its inputs and in its 
ability to identify and to ignore without further delay the out- 
puts of the first faulty element. 

The combinatorial method for assessing the reliability of 
such a structure is entirely straightforward: the probability 

that the output is correct is simply the probability that at most 
two of the four elements have failed. If any single element has 
a probability P(t) of surviving until time t, the probability R(t) 
that the voter outputs are still correct at time t is therefore 
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R(t) 


= I. (O' 


[P(t)]^"^[l-P(t) 


( 1 ) 


6P^(t) - 8P^(t) + 3P^(t) 


The Markov model of the structure in question is equally 

straightforward. In general, a structure can be represented by 

a Markov model if it is possible to characterize it in terms of 

states (the various states defined, for example, by the number of 

component failures and other relevant parameters) and transition 

rates between states, with the proviso that the transition rate 

r. . (t) between state S. and state S. is, for all i and j, a func- 
13 13 

tion only of i and j and, possibly, the time t measured from the 
entry into some known initial state (cf . Figure 1) . Thus, if the 
system is known to be in state at time t, the probability S^(t) 
that it has not left that state by time t ^ t is given by the solu- 
tion to the differential equation 

-S^ (t) = 2 r^.(t)S^(t) t >; T 

j ^ 

with the initial condition S^(t) = 1. 

If the transition rates r^^ (t) are all independent of t, the 
Markov model is said to be (time) homogeneous. In this case, the 
differential equation is readily solved, yielding 
-X (t-T) 


S. (t) = e 
1 


t > T 


with A = ^ r. .. The holding time in each state, in this case, is 

j 13 

exponentially distributed. 

Consequently, if in the structure of concern here, the proba- 
bility P(t) that any single element survives until time t is ex- 
ponentially distributed (P(t) = e ^^) , and if state S. refers to 
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From Other States 



Figure 1 


General Structure of a Markov Model 
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the state of the system characterized by i component failures, 


then the distribution of the holding time in state i is just 
e i)^t^ with 4 the number of initially operational elements 
and X the hazard rate of each element. The transition rate r^^ (t) 
is then simply 


r . . (t) 

13 


P^' (t) 

Pj, (t) 


|(4-i)X 


j = i+1 
j 7^ i+1 


and the Markov model is as shown in Figure 2. The three states. 



Figure 2 

Markov Model of a 2-Out-of-4 Structure 


labeled 0, 1, and 2, correspond to the number of failed elements; 
the state labeled F denotes the failed state (more than two failed 
elements) . 

The reliability of the structure is also easy to determine 
from its Markov model: Let pQ(t) be the probability that the sys- 

tem is in state i at time t. Then 

Pq' ( t) - -4 XPq ( t) 

P ' (t) = 4 XPq ( t) - 3XP (t) 

(2) 

P 2 ' (t) = 3XP^(t) - 2XP2(t) 

Pp' (t) = 2XP2(t) 
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This set of linear, first-order 
solved by conventional methods 

Pg(t) = 

P2(t) = 

Pp(t) = l-PQ(t) - P^(t) - 

SO that 


R(t) = 1 - P„(t) = 6e 
F 


-2Xt 


differential equations can be 
to yield 


(3) 




„ -3Xt ^ - -4At 
- 8e + 3e 


(4) 


as before . 


The analysis so far has assumed perfect coverage . In particu- 
lar, it has been assumed that the first faulty element is correctly 
identified with probability 1. Suppose, instead, that it is cor- 
rectly identified with probability C; i.e., with probability 1-C 
the outputs of the first failed element are not ignored by the 
voter. Then with probability 1-C, a second failure will cause the 
voter to accept two erroneous inputs and hence to produce an unre- 
liable output. The system reliability can be determined combina- 
torially by observing that the system will function properly if at 
time t it has sustained no more than one element failure or, with 
probability C, if it has sustained exactly two element failures. 
Thus, 

R(t) = [P(t)]‘*”^tl-P(t)]^ 

= R*(t) - 6<1-C) tP(t)]^[l-P(t)]^ 

with R*(t) the perfect-coverage reliability as given in equation 1, 
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The Markov model of Figure 2 needs only to be modified as 
shown in Figure 3 to account for this imperfect coverage effect. 
An analysis virtually identical to that of the previous Markov 



Figure 3 

Markov Model of a 2-Out-of-4 Structure With 
Imperfect Coverage 


model establishes that 
P(,(t) = 

P^(t) = 

P^Ct) = 


Pp(t) = 


( 6 ) 


so that, again, the combinatorial model and the Markov model yield 
identical results . 


The procedures for extending both the combinatorial and the 
Markov methodologies to more complex structures are generally 
straightforward. One of the major limitations to both approaches, 
however, is already evident in the simple example just considered. 
This limitation stems from the fact that it is rarely satisfactory 
to treat the coverage probability as a constant parameter. And 
since, as already observed, coverage failures are typically the 
dominant source of system failure in highly reliable systems, it 
is particularly important that coverage be accurately modeled. 
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Suppose, for example, that in the structure just considered, 
the reason coverage failures can occur is that a certain amount 
of time, say t seconds, is needed to detect that an element has 
failed and to take the appropriate action to eliminate its output 
from subsequent voter inputs . Should a second failure occur dur- 
ing that interval, the voter is again presented with two poten- 
tially erroneous inputs and its output is consequently unreliable. 
The probability of a coverage failure, then, is the probability 
that two element failures occur within a x-second interval. Un- 
fortunately, this is not a constant probability. 

To handle this case combinatorially , observe that the proba- 
bility that the system has failed by time t is equal to the prob- 
ability that it has sustained either more than two failures, or 
exactly two failures within t seconds of each other. Thus, 


1 - R(t) = 


■m 


+ 4-3P (t) 


[p(t)]‘* ^[1 - p(t)]^ 


r t r min [rij+T , 

3 0 Jn^ 


t] 


P ' (ri^) P ' (ri2) 


If, as assinned earlier, P(t) - e 
evaluated, yielding 


-At 


this expression is easily 


(7) 


R(t) = 


4P^(t) - 3p‘®(t) 


t < T 


R*(t) - 6P^(t) [(1 - e - P^(t) (e^^-1)] t > t 


(8) 


with R*(t) as previously defined. The actual coverage probability 
(cf. equations 5 and 8) in this case is 


C = C(t) = 


0 


1 - 


d-e~^'^) - P^ (t) (e^’^-1) 


[l-P(t) 

and is indeed a function of time . 


t < T 
t > T 


(9) 
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The Markov method of modeling redundant structures can also be 
extended to include more complex coverage situations by using the 
method of stages (Ref. 1) . 

The state diagram shown in Figure 4a illustrates the principle. 




Figure 4a 

Stage Representation of a Constant Delay 



Figure 4b 

Markov Model of a 2-Out-of-4 Structure 
With Constant Coverage Delay 

This diagram is characterized by the differential equation 
P ’(t) = -n/T P. (t) 

P^ ' (t) = n/T (P^ (t) - P^ (t)) 1 < i ^ n 

i i-1 “^i 
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For large n, then, the series of states shown in Figure 4a provides 
a good approximation to a constant x-second delay. This same series 
of states embedded in the Markov model of a 2-out-of~4 structure 
(Figure 4b) represents, approximately, the constant coverage delay 
model under consideration here. 


This method of stages can be generalized by introducing other 
combinations of pseudo-states and selecting appropriate interstage 
transition rates . The advantage of this technique is that it pro- 
vides an approximate method for handling non-exponentially dis- 
tributed holding times without abandoning homogeneous Markov models . 
The disadvantage is that good approximations often entail a sub- 
stantial increase in the number of required states, a number which 
can be enormous for the reliability models of interest here even 
without the addition of pseudo-states . 

It is possible to avoid adding pseudo-states and still retain 
some advantages of the Markov method by generalizing the notion of 
a Markov process. Consider the state diagram shown in Figure 5. 
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Figure 5 

Semi-Markov Model of a 2-Qut-of~4 Structure 
With Imperfect Coverage 


This diagram is similar to that of Figure 4b except that the n 
pseudo-states in the latter diagram have been collapsed into a 
single state here. The cost of doing this is to introduce a tran- 
sition rate* 6(ri)/d(ri) which is now a function of the time n from 
the entry into state A. If n were a measure of the time from en- 
try into the initial state of the model, the model v/ould describe 
an inhomogeneous Markov process. As it is, however, the process 
is not even Markov; the probability of a transition from state A 
to state 1 is a function not only of the two states but of the time 
spent in state A as well. Such processes are called semi -Markov 
(Ref . 2) . 

Semi-Markov processes, while less analytically tractable than 
Markov processes, can nevertheless be represented in terms of linear 
integral equations and the state-occupation probabilities can often 
be obtained without undue difficulty. The state-occupation proba- 
bilities P^(t) of the process of Figure 5, in particular, satisfy 


*The function <S(ti) here represents the probability density of a 
transition from state A to state 1 exactly n time units after a 
transition into state A, under the condition that no other tran- 
sitions were possible, and d(n) is the probability that no such 
transition has yet taken place by time ri . Thus, the rate of 
such transitions, under the condition just described, is given 
by the ratio 6(ri)/d(n)* 
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the following set of equations: 
-4Xt 


P^Ct) = e 


p (t) =4X1 ^'d(n)dn e 

* 'o 






( 10 ) 


P-, (t) = 4 I (1-e 6(n)dn e 

^ 'o 


1 


P,(t) = 3A\ P (n)e'^^*^ '^^dn 

’ 0 


(The probability P (t) , for example, is just the product of the 

probability density of a failure at time t-n^ the probability d(ri) 

that a transition from state A to state 1 has not taken place in 

“3 At 

the intervening time ri , and the probability e that no other 

failure has occurred by time t. Entirely similar arguments can 
be used to establish the other equations.) In the present example, 
5 (t^) = 6^(ti~t) with <S^(t) the Dirac delta function and t the (fixed) 
coverage delay. Consequently, 


d(ri) 


= 1 



6(ti ' ) dri ' 


Si n < T 

I 0 n ^ T 


and 


P_(t) = 
A 


. -3At.^ -At 
4e (1-e ) 

. -3At, -A(t-x) -At. 
4e (e ' - e ) 


t < T 
t > T 


Pj^(t) = 


. -3At , - -A (t-T) 
4e (1-e ) 


t < T 
t > T 


p^it) 


0 

g^-X(2t+T) _ 3-X(t-T))2 


t < T 
t > T 


( 11 ) 
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Since 


R(t) = + P^(t) + Pj^{t) + P2(t) 

this analysis yields results identical to the previous coinbinatorial 
analysis of the same example (cf . equations 8 and 11) . 

As noted earlier, an overwhelming disadvantage of the Markov 
method of modeling and analyzing the reliability of redundant struc- 
tures under the conditions of interest here (with the consequent 
heavy emphasis on coverage) is the extremely large number of states 
needed to describe the system. This, of course, is only exacerbated 
if the method of stages is used to approximate non-exponential hold- 
ing time distributions, but it remains a decisive limitation even if 
semi -Markov modeling techniques are used. 

To gauge the magnitude of the problem, consider a system con- 

th 

sis ting of n stages.* If the i— of these stages can sustain as many 
as m^ faults and still be operational, and if the number of distin- 
guishable states (e.g., active, benign, detected, etc.) that can be 
occupied by a stage i fault is then the number of possible opera- 

tional system states is 



This number can be large even for relatively small parameters 
m^, and n. When n=4 and ^i=6, mi=2 for all i, for example, N = 
614,656, Since CARE III actually allows n to be as large as 70 
and places no restrictions on m^, it is clear that conventional 
Markov-like techniques are not appropriate to the problem at hand. 


*In CARE III terminology, the term "stage" refers to an ensemble 
of identical, interchangeable units. This term should not be 
confused with the "method of stages" described earlier. 
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Unfortunately, the combinatorial approach to reliability 
analysis suffers from a similar computational explosion. A com- 
binatorial analysis, in effect, entails an itemization of the 
(mutually exclusive) sequences of events that can lead to a 
failure and then a determination of the probability of each of 
these event sequences. Thus, the emphasis is on the paths con- 
necting the various possible system states rather than on the 
states themselves. Obviously, however, the number of such paths 
increases at least as rapidly as the number of states they inter- 
connect, so a purely combinatorial approach to problems of the 
complexity of those of concern here does not appear to be very 
attractive either. 
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3 . The CARE III Approach 


The motivation for the CARE III approach to reliability 
analysis is evident from an examination of equation 12 . It will 
be noted, in particular, that the magnitude of N in equation 12 
is a very rapidly increasing function of the parameters (If 

all were equal to 1 rather than the 6 selected in the earlier 
example, N would be reduced from 614,656 to 81.) The reason 
these parameters must, in general, be greater than unity is 

that the coverage associated with a failure depends upon the 
states of other failed elements in the system. That is, the prob- 
ability that the system recovers from a failure in element A may 
well depend upon whether or not element B has previously failed, 
whether its failure has been detected, whether an erroneous out- 
put has been produced as a result of that failure, and whether 
element B is in a f ailed-active state (capable of producing errone- 
ous outputs) or in a f ailed-benign state (incapable, at least 
temporarily, of producing further errors) . 

The key to reducing without decreasing the ability to in- 
clude all relevant coverage factors into the reliability model is 
suggested by the previous analysis of the 2-out-of-4 structure . 
Figure 3 shows a Markov model of that structure with the entire 
effect of coverage reflected in the state-transition rates. While 
the coverage probability is shown as a constant in Figure 3, it 
was demonstrated that the effect of more complex coverage situ- 
ations could be handled by allowing this probability to be a suit- 
ably defined function of time (cf . equation 9) . 

The CARE III method, then, is to represent the structure of 
interest as an inhomogeneous Markov model, with the different 
states distinguished only by the numbers of faults in each of the 
various stages comprising the system. The state-transition rates 
are separately determined using a coverage model to account for 
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fault-state effects . Although combinatorial techniques could have 
been used (as they were, for example, to derive the results of 
equation 9) , the coverage model found to be most appropriate for 
CARE III is one based on semi-Markov techniques similar to those 
used in analyzing the model of Figure 5. 

The potential advantage of this approach is apparent. The 
number of states that have to be accounted for in the reliability 
model is reduced from that given in equation 12 to a number more 
manageable ; 

n 

N' = II (m.+l) (cf. equation 12). 
i=l ^ 

The cost of doing this, of course, is 1) to force the reliability 
model to be inhomogeneous*, and 2) to necessitate a separate analy- 
sis to determine the needed coverage parameters. For reliability 
assessment problems of the complexity of concern here, however, the 
advantages of this approach, in terms of computational effort, far 
outweigh its disadvantages. In effect, the model has been reduced 
from one having N = n^ ^ states to one having n^ + n 2 + 

... + n^ states, with n^ denoting the number of relevant states 
given that i faults have already taken place. (The reduction is 
in fact more dramatic than this since much of the computational 
effort needed to determine the transition functions given i faults 
can also be used to determine these functions given j ^ i. faults.) 

In order to realize the full advantage of this reliability and 
coverage model separation, however, it is necessary to introduce 


*This increased flexibility does have ancillary advantages, however: 
the hazard rates associated with the various system elements are no 
longer restricted to be time-independent. There are situations in 
which this added degree of freedom is needed to reflect accurately 
the physical events actually being modeled. 
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some approximations having to do with the probability of occurrence 
of certain joint events. If A and B represent two events and the 
probability of an event E is denoted P (E) , then, as is well known, 
the probability that either A or B occurs is 

P(A+B) = P(A) + P(B) - P(A-B) 

with P(A*B) the probability that A and B both take place. Now sup- 
pose both A and B represent compound events; that is, A is said to 
have occurred only if the events A, , A^, ..., A have all occurred, 

and similarly for B. Suppose further, that at least one of the B 
events, say B^, is independent of all events in the set {A^,A 2 ,...r 

A } . 
n 

Then 

P(A'B) = P(b|a)P(A) ^P(B^)P(A) 

and 

P(A) [l-P(B^)] + P(B) ^ P(A+B) ^ P(A) + P(B) 

It follows that: 1) P(A+B) is always overbounded by the sum 

of the probabilities of the two individual events A and B. 2) If 
either of the two events depends on the occurrence of some subevent 
that is not part of the other, and if the probability of this sub- 
event is small, the error introduced by approximating P (A+B) by 
P (A) + P(B) is also small. Specif ically , 

P(A) -f P(B) - P(A+B) ^ P(A)P(B^) 

In the present instance, the events of concern are those that 
lead to system failure. The probability of any one of these events 
is therefore not greater than the probability P^(t) of system failure, 
a probability that is already small, for all t of interest, for the 
highly reliable systems for which CARE III was designed. Thus, if 
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two events A and B both lead to system failure, if one of these 
events depends upon a subevent not common to the other, and if 
the probability of this subevent is also of the order of P^(t) 
or less, the error introduced by approximating the probability of 

either event by the sum of their individual probabilities is of 

o “4 

the order of P- (t) . Since P^(t) is almost always less than 10 

^ ^ -8 
for cases of interest here and is typically of the order of 10 

or less (if this were not true, reliability models much simpler 
than CARE III would suffice) , the error introduced by such approxi- 
mations is truly negligible. Moreover, even if this were not true, 
such approximations overbound the probability of a system failure 
and hence provide a conservative reliability estimate in any case. 
Details as to exactly how these approximations are introduced will 
become apparent in the ensuing discussion. 


(i) The CARE III Reliability Model 


Let P.i . (tlx) denote the conditional probability that a system 

j I ^ 

is in state j at time t given that it was in state i at time t . 

Similarly, let Pi . .(t|n,T) denote the conditional probability that 

36 I 3 , 1 

a system is in state 36 at time t given that it was in state j at 
time ri and in state i at time t. Then, clearly, for any T<n<t, 


36 i 


(t t) = 


3 \ 1 


(h 1 t) P 


^ j / i 


(t| n , t) 


(13) 


with the sum taken over all the (assumed finite number of) possible 

intermediate states j. (If/ for all T<T)<t, P. i . .(t|ri/X) = P. i .(t|ri)/ 

36 I D , 1 36 I 3 

then equation 13 reduces to the Chapman-Kolmogorov equation for con- 
tinuous-time, discrete state systems.) 


It follows from equation 13 that 


P 

+ 


36 [L 


. (t + At 


jj^36 


P.I. 

3 \ 1 


|t) = 
(t| T)P 



i^^l'^^^36|36,i^^ At| t,T) 
(t + At I t,x) 

/ 1 


( 14 ) 
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Let 


and 




i(t|T) 


^(t t) 


liiti 
At“>- 0 


lim 
At-^- 0 


1 - + At|t,T) 

At 

At 


Then, rearranging terms in equation 14, dividing by t and taking 
the limit as t"^ 0 yields 




8t 






(15) 


This set of equations is a form of the Kolmogorov forward 
equations. It differs from the more conventional form in that 
the transition parameters A. |.(t|T) are also functions of the 
initial state i of the system at time t. If the notation indi- 
cating the condition that the system be in state i at time t is 
suppressed, equation 15 can be expressed in the more convenient 
form 


dt 





Pj(t)A.^(t) 


(16) 


It must be remembered in the ensuing discussion, however, that 
the transition parameters may also be functions of the initial 
conditions . 


In the CARE III context, it is necessary to distinguish states 
both in terms of the number of faults that have been sustained 
in each stage of the system but also, of course, with regard to 
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dPj^ (t) 


-P,(t)A^(t) + 2 P,(t) X. 


(17a) 


dQj^(t) 




(17b) 


wit±i 


( 1 ) ( 2 ) 

x,(t) = y,(t) + 2 


The term l-*£(t) here represents the rate of occurrence, in a 

system which is still operational after £ failures, of events that 

cause the system to fail even though no new faults have taken place.* 

( 1 ) ( 2 ) 

The terms X^^(t) and Xj^(t) represent the rates of occurrence of 
faults that take the system from operational state j to, respective- 
ly, operational state ^ and failed state £. Since, as has been 
repeatedly observed in this discussion, the systems of concern here 

are highly reliable, xf^\t) must in general be much larger than 

( 2 ) 

Aj^(t) and ^^(t) must be large compared to Vj^(b) . Thus, to a good 
approximation, equation 17a can be rewritten in the form 


dP£(t) 


= -P^(t)X^*(t) + 2 


(t)X 


(18a) 


With lj^*(t) = ^j^(t) + ^jj^(t) and A^*(t) = 2^ ^£j 


*Such events can be caused, for example, by latent faults becoming 
active or producing erroneous outputs; this will be elaborated upon 
shortly . 
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the solutions to these equations are denoted by P *(t), equation 

X/ 

17b assumes the approximate form 

<lQs(t) _ (2) 

= Pj^*(t)Uj, (t) + ^ P.*(t)X (18b) 

Although the differential equations (17) could be solved 
directly, the approximations introduced in replacing P. (t) by 
P^*(t) are indeed negligible for all cases of interest. It will 
be observed, in fact, that P^*(t) is just the probability that 
the system would be operating with Z failures were the coverage 
perfect. Thus, replacing Pj^(t) in equation 17b by Pj^*(t) is 
equivalent to allowing systems that have already suffered from a 
coverage failure to be counted among those still susceptible to 
coverage failures. This is, in turn, equivalent to replacing 
P(A+B) with P (A) + P(B) with A and B both representing highly un- 
likely coverage failure events. As noted earlier, such approxi- 

2 

matrons rntroduce an error of the order of p with p the , in 
this case, very small probability of either of these events by 
itself. The advantage of introducing this approximation is that 
the probabilities P^^*(t) can be readily evaluated using straight- 
forward combinatorial techniques, thereby avoiding the need for 
the more time consuming, and negligibly more accurate, calculation 
of the probabilities P^(t) as defined by the equation 17a, 

(ii) The Coverage Model 

The purpose of the CARE III coverage model is to determine 

(2) 

the transition rates, and needed to calculate the 

failed state probabilities Qj^(t) as defined by the set of equations 
18b. CARE III recognizes three basic causes of coverage failure: 
1) An existing latent fault causes the system to take some unaccept- 
able action (an error is propagated) . 2) A new fault occurs which. 
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in combination with an existing latent fault, prevents the system 
from functioning properly. 3) A pair of existing latent faults 
for the first time reach a system-disabling state. The transition 
rates associated with the first and third of these events are col- 
lectively represented by the term P^Ct) in the equations 18b; the 

rate of occurrence of the second type of event is represented by 

( 2 ) 

the term A. (t) . A fault is said to be latent from the time it 
first occurs until it is either detected and isolated from the sys- 
tem or, in the case of a transient fault, reaches a benign state. 
The function of the coverage model is to represent the behavior of 
each fault during its latency period. 

Note that the second and third causes of coverage failure both 
depend on the existence of a pair of latent faults. It often hap- 
pens that a fault, while entirely benign itself, can become lethal 
in combination with some other fault. (A triple-modular redundant 
configuration consisting of three identical elements feeding a 
majority voter is an obvious example of this. If any one element 
malfunctions, its output is ignored by the voter. If a second 
element fails before the first failure is detected, however, the 
combination of the two could well produce an erroneous output.) 

In many reliability analyses, such second-order effects are negli- 
gible compared to other causes of failure and consequently are 
simply ignored. In the highly reliable systems for which CARE III 
was designed, however, such effects are frequently the dominent 
cause of system failure. 

Obviously, not all pairs of latent faults pose any threat to 
the system. Faulty modules providing inputs to two independent 
voters, for example, should create no difficulty even if both are 
simultaneously in the active, error-producing, state. It is 
therefore necessary for the user to specify all critical pairs of 
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faults; i.e., to specify those pairs of modules which could cause 
the system to fail should the second modules malfunction before 
the first one has been identified as faulty. (This critical-pair 
specification is easily accomplished using the same input routine 
used to specify the overall system configuration; see below.) 

The coverage model thus actually consists of two coverage 
models: a single-fault model to trace the various states of a 

single fault, and a double-fault model to track fault pairs . The 
single-fault model is shown in Figure 7. VThen a fault first occurs, 
it is said to be in the active state (state A in Figure 7) . If the 
fault is transient or intermittent, it may jump from the active to 
the benign state (state B) . These transitions take place at a con- 
stant rate a; for permanent, non-intermittent faults, of course, 
a = 0. If the fault is intermittent, the reverse, benign- to-active , 
transition takes place at some constant rate $; for transient faults, 
Oi ^ 0 and 3=0. In the benign state, the fault is incapable of 
causing any dlscernable malfunction. Thus, it can neither be detected 
nor can it produce erroneous output. In the active state, however, 
the fault is both detectable and capable of producing incorrect out- 
put. The rate at which either of these events takes place depends 
upon the operating environment and, in particular, on how frequently 
and how often the faulty element is exercised in a way that causes 
the defect to manifest itself. Once an erroneous output is produced, 
the system is said to be in the active-error state (A^) - Again, if 
the fault is either intermittent or transient, it may jump to the 
benign state, although now the error is still present so the state 
is designated the benign-error state (state Bg ; the reason for dis- 
tinguishing between states Ag and B^ will shortly become apparent) . 
When the faulty element is in either of the two error states, the 

error propagates at some rate ^ measured from the time of entry 

into that state, to some point in the system at which it is either 




t = time from entry into 
active state 

T = time from entry into 
error state 
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detected (e.g., through a decoder or a voter) or else escapes 
undetected and results in a system failure (state F) . The proba- 
bilities of these two alternatives is C and 1-C, respectively. 

If the fault is detected, either through testing or through the 
detection of an erroneous output, the faulty element enters the 
active-detected state or benign-detected state depending 
on the state of the fault when it was detected. At that time 
a decision is made as to whether the faulty element is to be re- 
tired from the system or whether it can continue to be used. 

This latter decision might be made, for example, if the fault 
recovery procedure included a diagnostic routine designed to dis- 
tinguish between permanent and transient faults. If the fault is 
detected in the active state, the decision is made with probability 
that the element must be retired from service; if it is detected 

in the benign state, the same decision is made with probability P . 

s 

Thus, with probabilities 1-P, and 1-P_ , respectively, the faulty 
element is returned to service following the detection of the fault. 
(The dashed lines in Figure 7 indicate that the transition takes 
place immediately with the probability indicated.) 

Note that as long as the option is available to diagnose a 
detected fault as transient, it is possible that this decision is 
made erroneously. Thus P^ and even P^ may be less than unity even 
when the fault is in fact permanent or intermittent. Similarly, 

Pg and especially P^ may be greater than zero when the fault is in- 
deed transient. The model assumes that the effect of a decision that 
the fault is transient is to eliminate the error, if an error had 
already been produced, and to return the faulty element to the error- 
free, active or benign state, depending on its state when the fault 
was detected. If the fault was transient and detected in the benign 
state, it either remains in the benign-detected state or returns to 
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the error-free benign state. In either case, since 3=0, it can 
never again become active so it ceases to pose any further threat 
to the system. If the fault is transient and detected in the 
active state, or if it is permanent or intermittent and detected 
in either state, and if it is diagnosed as transient, it remains 
latent and may have another chance to cause the system to fail. 


Even more detailed single-fault models could, of course, be 
defined. Non-constant active-to-benign and benign-to-active tran- 
sition rates could be allowed, for example, and distinctions could 
be made between single and multiple errors. Moreover, such models 
could easily be incorporated into the CARE III structure. The 
model selected, however, was felt to be an effective compromise 
between the desire to allow the user as much flexibility as possi- 
ble in defining the behavior of a faulty element, and the need to 
keep the model from becoming so baroque that the user dispairs of 
ever defining all of the parameters. At present, the fault detec- 
tion rate 6(t)/d(t), the fault generation rate p(t)/r(t), and the 
error propagation rate e(t)/e(t) are all restricted to assume the 
form 


4>(t)/[l 



c|) ( n ) dri ] 


with 

cj)(t) = (j)e 


0 < t 


0 < t < l/(j) 

0 otherwise 

That is, either the transition rates or the transition density 
functions are assumed to be constant over some range; the function 
and, of course, the constant can be independently selected by the 
user for each of the three transition rates. In addition, the user 
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can define up to five fault types, each with its own set of 
specifiers (a, 3, 6(t) , p (t) , e(t), C, P_ , P„) / and designate 

A li 

that any or all of these types can afflict each of the system 
stages, with arbitrary rates of occurrence for each type at each 
stage . 

It might be supposed that the double faults could be modeled 
by simply combining two single-fault models and then determining 
if, and when, the two independent fault states form some lethal 
combination. The problem with this approach is that the two fault 
states may independently form a lethal combination repeatedly and 
the same system failure thereby counted multiply. (Since a second 
entry into a state is not necessarily a small -probability event 
given that the first entrance took place, the previously-used ar- 
gument, that the probability of both events is of the order of 
the square of the probability of either of them, is not applicable 
here.) It is therefore necessary to introduce a separate double- 
fault model. The model selected is shown in Figure 8. This model 
is applicable if a second fault occurs when the first fault is in 
the benign (error-free) state. (If this is not the case, the com- 
bination of the two faults is treated as lethal upon the occurrence 
of the second fault; see below.) Thus, the occurrence of the second 
fault places the fault-pair in the (first fault benign, 

second fault active) . From there, the fault-pair can go to the 
^1^2 (both faults benign) if the second fault becomes benign 

before the first fault becomes active, to the detected state D if 
the active fault is detected and diagnosed as permanent, or to the 
failed state F if the first fault becomes active with the second 
fault still also in the active state or if the second fault causes 
an error to be produced. Since both faults are benign in the 
state, the only possible transitions from that state are back to 


29 


I 



FIGURE 8 

CARE III DOUBLE FAULT MODEL 


3 ' 


the A. 2 B^ state or to the state (first fault active, second 

fault benign) with its entirely analogous transitions. 

It will be noted that the double-fault model is conservative, 
relative to the single-fault model, in its definition of a failed 
state. If both faults are ever simultaneously active, the system 
fails regardless of whether or not either fault has resulted in an 
error. Moreover, a system failure results if either fault produces 
an error even though that error could potentially be detected be- 
fore it causes any system damage. Obviously, a more elaborate 
model could have been postulated, one containing additional states 
to distinguish, among other things, the various possible error con- 
ditions. As ih the case of the single-fault model, however, a com- 
promise is required between the need to model accurately the impor- 
tant contributors to coverage failures and the desire not to over- 
burden the user with overly-fine distinctions. If both faults in 
a critical-pair are active, for example, and one of them produces 
an error, the probability that that error is detected before it 
causes system damage is presumably altered, possibly significantly, 
by the presence of the second fault. Similarly, the coverage para- 
meters may well be affected if both faults produce errors before 
either error propagates, A more elaborate double-fault model would 
force the user to examine these issues for every critical-fault 
pair . 

The compromise represented by the double-fault model seems to 
be a reasonable one for two reasons: 1) The most significant event 

in determining the probability of a lethal double-fault is the ex- 
istence of the latent first fault at the time of the second. The 
probability of this event, however, is determined using the single- 
fault model and hence does not depend on the details of the double- 
fault model. 2) .The conservativism of the double-fault model causes 
the probability of a double-fault coverage failure to be overbounded. 
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Thus, the double-fault model is consistent with the other CARE III 
approximations in that it results in a tight overbound on the 
system unreliability. 


The single- and double-fault coverage models are used by the 
CARE III reliability model as follows: Let p^(t-T|il,t) be the 

probability density of a specific type of element failure at time 
t-T, given that £ failures have occurred by time t. Then, if 
p^„(T,f) is the probability density of system failures due to the 
single fault f t time units after its occurrence, the rate of oc- 
currence of system failures at time t due to this event is just 


( 1 ) 

(t,f ) 


P (T,f)p (t-T|5,,t)dT 
0 ^ 


( 19 ) 


Similarly, if P (T,f) and P (T,f) are the probabilities that the 

fault f is in the active and benign states, respectively, t time 

units after its occurrence and if p (x,f ,f_) is the probability 

Cr I ^ 

density of system failures due to the critical-fault-pair f^^,f 2 
T time units after the occurence of the second fault, the rate of 
system failures at time t due to the first of a critical pair of 
faults being active when the second takes place is 

(2) f t 

X^^(t,fj^f2> = p^ (t) I (t-T I il,t)dT (20) 


with j representing the number of element failures before the fault 
f^ and £ the number after f ^ • (Recall that, in general, j and £ 
are vectors whose components indicate the number of failures in 
each stage.) 

The rate of system failures at time t due to a critical-fault- 
pair subsequent to the second fault is 
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( 2 ) 


I 




s 


Q ^ PB(T^|fl)Pf^(t-T^-T2U,t)dT^dT2 


( 21 ) 


The transition rates indicated in Figure 6 are thus 


'O (1) (2) 

= 2. + 2. 

all f all critical 

pairs 


(2) 

ij,<« 


2 


(2) 


all critical 
pairs f^.^2 


( 22 ) 


Note that the function p^(t-T|il,t) is conditioned on the 
event that the system has suffered exactly 9 . element failures by 
time t. Actually, the function of interest is subject to the 
additional condition that the system has also not failed by time 
t since the transitions of concern are those taking the system 
from an operating state to a failed state. Without this added 
condition, the function p^(t-x|t,Ji) is easily evaluated; with it, 
it is obviously considerably more difficult. Ignoring this con- 
dition, however, is entirely equivalent to replacing ^^^(t) with 
P^*(t) as previously discussed and introduces errors of the same 
order of magnitude. That is, the approximation causes this prob- 
ability P^(t) of system failure to be overestimated by an amount 
^ 2 

of the order of P^ (t) . 

(iii) Mathematical Details 

The following paragraphs describe in detail the mathematical 
model as it is implemented in CARE III. As already mentioned. 
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the system to be modeled is ass\imed to consist of some number (up 
to 70) stages with each stage composed of one or more identical 
interchangeable elements or modules. The modules in each stage 
are subject to up to five user-defined categories of faults. A 
fault is characterized in terms of its rate of occurrence and in 
terms of its coverage model parameters . Fault occurrence rates 
are constrained to be of the form wXt^ ^ (i.e., fault distribu- 
tions are constrained to be Weibull) with (o and X user defined. 

The user can also specify up to five sets of coverage model para- 
meters (a, 3, p(t), e(t), 6 (t) , C, P_ , ; each such set defines 

a fault type. (Thus, for example, it is possible to define a 
permanent fault type, a = 0 ; a transient type, a ^ 0, 3 = 0 ; and 

an intermittent a 7 ^ 0 , 3 0 ; each having its own characteristics 

with regard to detectability, error-propagation, etc.) Fault 
category then refers to a fault that can affect any module in 
stage x; it is characterized by the parameters ^ 

with j a fault-type designator . ^ ^ 


In addition, the user must specify the number of modules n 
initially available at each stage, the minimum number m needed 
for that stage to function properly, the various combinations of 
stage failures that constitute a system failure, and the proba- 
bilities b (v ,v ) that a specific module in stage x forms a 
xy X y 

critical pair with a specific module in stage y given that v 
stage-x modules and stage-y modules are known to have failed 
and are therefore no longer being used.* 


*These last two tasks are both accomplished with relative ease 
through a CARE III user interface incorporating a program called 
FTREE developed by Boeing Aircraft Co. and described in the CARE 
III User's Manual. 
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On the basis of this user-supplied information, CARE III 
then determines the system unreliability using the equation 

R(t) = 1 - R(t) = V Q^(t) + 2 (23) 

JlsL “ JteL ~ 


with L the set of module failure combinations that would leave 
the system operational in the absence of a coverage failure, L the 
complementary set, P^*(t) the probability that the system would 
be in state Si at time t in the absence of a coverage failure, and 


Q,(t) 



(xli-e *(t) (n -S, +1)X 

“ y j 


(t) 


(24) 


+ A' (t I (t) + a'(T|yP^*(T) 


dx 


This equation is seen to be identical to equation 17b with 


= A’ (t|« + a' (t|^) 


(2) 

X J.(t) 


2 




C (t 


i-e ) (n -Z +1) X (t) 

- Y y y y. 


and ^ ^ith the unit vector denoting a stage-y module. 


( 2 ), 


It will be recalled from equations 19 and 20 that (t) and 


A..(t) are defined in terms of functions of the form 


1 


P2(T)p^(t-T) dT 


with p^(t) a measure of the rate at which a certain class of faults 
occurs and P 2 ('^) ^ function of the interval x between that occur- 
rence and the entry of the fault into a particular cover age -model 
state. Since, typically, faults occur at rates no greater than 
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one fault, every several thousand hours, and since coverage-state 
time constants are usually of the order of fractions of seconds 
and rarely exceed a few minutes in duration, P 2 (t) is a much more 
slowly varying function of time than is P 2 (t)^ Thus, to a very 
good approximation 

p^(t-x) a(t) + Tb(t) + T^c(t) (25) 

over the range of t for which negligibly small, with 

a(t), b(t), and c(t) suitably defined. This approximation is used 
in CARE III with a(t), b(t), and c(t) defined to make the approxi- 
mation exact at the two end points and at the midpoint of the range 
of interest of t. The major advantage of introducing this approxi- 
mation is that, with it, 

^ p^ (t)p^ (t-x)dT Prj a(t)m2°(t) + b(t)m2^(t) + c(t)m2^(t) (26) 

with 

ni 2 ^(t) x^p^(x)dx 

Thus , the convolution can be separated into two parts , one part 
depending only on the re liability -model function Pj(t) and the 
other involving only the first three moments of the coverage-model 
function P 2 (*^) • Moreover, these moments need be evaluated only at 
those points of time t relevant to the reliability model. This 
significantly simplifies the interface between the coverage and 
reliability models. 

With these preliminaries, the reliability model functions 
used in CARE III are itemized in Table 1 and the coverage model 
functions in Table 2. 
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Reliability Model Functions 
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t HERE IS A MEASURE OF THE TIME SINCE THE ENTRY INTO STATE A. 
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Double-Fault Model Equations 


ai 

LLl 


1 






o 

1 — 










I — 

LI_ 



LU 



Ll- 




z 

< 

CM 

>• 

s: 




Si 



*— H 


CQ 

oc 

»— ■ 


o 

LU 

o 

LU 



to 

rH 

1- 

1- 


ca 


oc: 

1 — 


>- 

1- 

PQ 

z 



U- 

< 

U- 

< 


DC 

1— ■ 


LU 

-p 



H 


1- 


J— 

z 

LU 

1 



Ul 

CO 

LU 

to 


Z 

z> 

1 — 

LU 

CM 


1 — 


I — 



LU 

LU 

< 

Cd 

PQ 


c 

O 

<c 

o 



Si 

1 — 


1 — 1 



1- 

QL 

1- 


U- 

1-^ 

to 

LU 

PQ 







o 

1- 


O 



Z 

•H 


•r-l 




o 


LU 

CO 

o 


O 



>- 

4J 

!- 

>- 

1- 

1 — 1 

1—1 

•I— > 

i— I 

-n 


1— 


z 

!— 

< 

1— 

1- 

< 

1- 

< 


1—1 

ll. 

l—l 

l—l 

1 — 


1 — 1 


1— t 



to 



CO 

to 


Xfi 

LU 

to 

LU 


sc 

LU 

> 

z 


l—l 

Z 

1— 


1- 

CM 

LU 

1- 

cl: 

LU 

o 

LJ_ 

< 

< 

< 

< 

OQ- 

1- 

< 

J- 

H- 

1- 

LU 

CH 


or 

I — 

1 — 1 

'Z. 

1- 

z 

Z 

Z 


H 

to 

1- 

CO 

m 

l—l 

to 

LU 

1— « 



+ 




ZC 


td 





+ 


+ 


o 

+ 










1 1 


P 





, — . 




GO 








H 

H 

P 

p 


+» 

•n 

p 

•P 



— ' 

'O 

— 

*d 

LU 


P 







1 — 1 

1 — 1 

1 — 1 

ctr 

•I— 1 


-o 

•H 



P 

- — > 


- — . 

D_ 

td 

P 

td 




— N 


, — , 

p 

X 


■ — ' 







P 

— 

LU 

-P 

■n 

P 

-P 



— ' 

CM 

— ' 

CM 


' — 


— ' 

w- 



CM 

P 

CM 

Xl 

1 

■r~> 

, — , 

•n 

•H 



CQ 


oa 


<=C 

U 

P 

Q. 





P 

. — . 

p 

c_i 







H 

— 

p 

'■ — ' 

l—l 

-P 

•H 

P 

4-) 



1 

rH 

1 

iH 




— ' 

w 



P 

oa 

P 

cn 


•n 


-o 

“H 




V 



SI 1 

Ti 

■o 


JQ 



I — t 

p 

rH 

p 

LU 


< 





U 

1 

P 

1 

m 

•P 

PL 

p 




1 — 1 

p 

• — • 

P 

1— 

■w 

1 









<3: 1 


iH 

-H 

•n 



P o 

CM 

P O 

CM 

1 

oa 



e 




U 


P 

1 











o 


CM 

•H 


<N 

•H 





>■ ^1 



1 



1 





1 — 


rH 

ro 


iH 

CO 



.. K 


cu 

P 



-p 



P 


P 



— ' 

II 

II 

' — ' 

11 

II 

— ' 





1=) 

-H 



•H 





n 


LJ_ 

U 

•H 

•n 

M-l 

'H 

-n 

O 


U 



48 


UNITS AFTER A PREVIOUS 
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4 , Concluding Remarks 


It is, of course, obvious that the more reliable a system 
becomes, the more improbable are the events that cause it to fail. 
Accordingly, reliability models designed to estimate the reliabili- 
ty of such systems must necessarily take into account effects 
which could be ignored or only roughly approximated in models de- 
signed for less reliable structures. These effects are generally 
referred to as coverage effects; that is, effects that result in 
system failure due, not to an exhaustion of resources, but rather 
to faults that, while circumventable , are not detected and isolated 
before they have caused the system as a whole to malfunction. 

CARE III is designed to allow the user to model coverage effects 
to a detail heretofore impossible. To take full advantage of this 
capability, the user must attempt to specify more completely just 
how the effects of a fault make themselves manifest to the system. 

In order to estimate the distribution of the time from the occur- 
rence of a fault to its detection, in particular, consideration 
must be given to the frequency and thoroughness with which the 
faulty module is tested. If the module is tested every x seconds, 
for example, and if the probability is unity that the fault is de- 
tected if it is present when the test is conducted, then the dis- 
tribution of the time to detection is well modeled as d(t) - 1-t/x, 

0 ^ t ^ T. If, on the other hand, the module is tested at random 

intervals with a less than certain outcome even if the fault is 

*5 1 

present, a distribution of the form d(t) = e might be more ap- 
propriate. Similar considerations are needed to select the other 
relevant functions and parameters used in the CARE III coverage 
model . 

In many cases, coverage model parameters may be difficult to 
determine. Even in these cases, it is felt that CARE III can still 
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play a valuable role for two reasons; 1) It forces the user to 
examine aspects of the system that might otherwise have been ig- 
nored. 2) More importantly, it provides a means for determining 
the sensitivity of the system's reliability to assumptions made 
both about the behavior of faults and about the mechanisms pro- 
vided to recover from them. 

Preliminary tests have shown that CARE III is indeed capable 
of accurately estimating the reliability of a variety of systems 
under a variety of conditions and assumptions (cf . Ref. 3) . These 
tests are being continued, both at Raytheon and elsewhere, and 
will be reported on in greater detail later. 
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